Objective

This project should allow you to apply the information you’ve learned in the course to a new dataset. While the structure of the final project will be more of a research project, you can use this knowledge to appropriately answer questions in all fields, along with the practical skills of writing a report that others can read. The dataset must be related to language or language processing in some way. You must use an analysis we learned in class.

Instructions

The final document should be a knitted HTML/PDF/Word document from a Markdown file. You will turn in the knitted document. Be sure to spell and grammar check your work! The following sections should be included:

Introduction

Introduce your research topic. What is the background knowledge that someone would need to understand the field or area that you have decided to investigate? In this section, you should include sources that help explain the background area and cite them in APA style. 5-10 articles across the paper would be appropriate - be sure to include these! They are part of the grade!

Sexual abuse is one of the most prevailing crimes against children. In the pre-internet era, sexual predators used to move around schools and trap innocent children by luring them with chocolates or toys to gain their trust. In the modern era, the luring is done through social media platforms. Online chatting platforms are now being used by sexual predators to lure children to provide sexual flavor.

Our project aims at developing a model that identifies a sexual predator using a dataset with historical chats that contain the chats with users identified as sexual predators. The dataset was acquired from (“PAN @ CLEF 2012 - Sexual Predator Identification” n.d.). PAN2012 is a sexual predator identification competition with a dataset containing the chats that include the conversations between young children and the sexual offender. There are two goals associated with the PAN2012 Sexual Predator identification task. One of them is to classify the sexual predator using supervised machine learning and the other is to identify all chats that contain inappropriate sexual content. For this project, we will only be focused on the classification task.

(Inches and Crestani, n.d.) contains an overview of the dataset and the associated tasks mentioned above. (Liu, Suen, and Ormandjieva 2017) uses LSTM with sentence vectors to classify predators and which conversations are suspicious. Their model performance was outstanding with a precision of 100% and a recall of 81% and was the top-ranking performance for the SPI competition. (Parapar, Losada, and Barreiro, n.d.) addressed this problem as a supervised Machine Learning task using different sets of features such as psychological features, chat-based features, etc. (Morris and Hirst, n.d.) used SVM based classification and extracted behavioral features from the chats. (Villatoro-tello et al., n.d.) focuses on the extraction of misbehaving chats from the corpus.

Hypothesis / Problem Statement

What is the data that you are using for you project? What is your hypothesis as to the outcome of the analysis? Why is the problem important for us to study or answer?

We would be using the PAN2012 training data for our project. We believe that the sexual offender would be engaging in long conversations with the children before starting intimate conversations. Thus we would have to remove a lot of the common words that occur in normal chats to train the model to recognize suspicious conversations. A well-trained model would be capable to detect suspicious chats from potential sexual offenders, which would provide an extra layer of safety to children heavily using social media platforms.

Statistical Analysis Plan

Explain the statistical analysis that you are using - you can assume some statistical background, but not to the specific design you are mentioning. For example, the person would know what a mean is, but not the more complex analyses.

We plan to convert the words into tfidf weights so that certain infrequent words especially from the predator would be weighted highly. These tfidf weights would be fed to supervised machine learning models such as logistic regression, support vector machines & random forest.

Method - Data - Variables

Explain the data you have selected to study. You can find data through many available corpora or other datasets online (ask for help here for sure!). How was the data collected? Who/what is in the data? Identify what the independent and dependent variables are for the analysis. How do these independent and dependent variables fit into the analyses you selected?

As mentioned above the PAN2012 training data collected from (“PAN @ CLEF 2012 - Sexual Predator Identification” n.d.) would be used for our analysis. Due to limited computational capacity, we will not use the PAN2012 test corpus as its twice the size of the training data. We will break the existing data into train and test data. The independent variables would be the tfidf weights extracted from the chats and the dependent variable would be the binary 1 and 0 variables where 1 would mean a sexual predator and 0 would be normal users. Since there are higher chances that we might encounter frequently used words seen in normal users with the sexual offenders tfidf representation would give higher weights to certain infrequent terms including the ones used by the offender.

Statistical Analysis Results

Analyze the data given your statistical plan. Report the appropriate statistics for that analysis (see lecture notes). Include figures! Include the R-chunks so we can see the analyses you ran and output from the study. Note what you are doing in each step.

  • Getting the required packages
library(reticulate)
import numpy as np
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
import contractions
import unicodedata
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
  • Getting the data
spi = pd.read_csv('spi_nourl.csv')
  • Exploring the dataset
spi.shape
## (903604, 7)
spi.head()
##                     conversation_id  message_line  ... author_id sexual_predator
## 0  e621da5de598c9321a1d505ea95e6a2d             1  ...         0               0
## 1  e621da5de598c9321a1d505ea95e6a2d             2  ...         0               0
## 2  e621da5de598c9321a1d505ea95e6a2d             3  ...         0               0
## 3  e621da5de598c9321a1d505ea95e6a2d             4  ...         0               0
## 4  e621da5de598c9321a1d505ea95e6a2d             5  ...         0               0
## 
## [5 rows x 7 columns]
sp = spi[spi['sexual_predator'] == 1]
f"There are {sp['author'].nunique()} sexual predators"
## 'There are 142 sexual predators'
f"There are a total of {spi['conversation_id'].nunique()} conversations"
## 'There are a total of 66927 conversations'
f"There are a total of {spi['author'].nunique()} users"
## 'There are a total of 97689 users'
  • wordclouds
# generating word frequencies
def gen_freq(text):
    # list of words
    word_list = []
    
    # loop over all the text docs and extract words into word_list
    for words in text.split():
        word_list.extend(words)
        
    # create word freuencies using word_list
    word_freq = pd.Series(word_list).value_counts()
    
    return word_freq

word_freq_normal = gen_freq(spi[spi['sexual_predator'] == 0]['chat_message'].str)
word_freq_predator = gen_freq(spi[spi['sexual_predator'] == 1]['chat_message'].str)
stop_words = stopwords.words('english')

Wordcloud of normal users

wc_normal = WordCloud(width = 400, height = 330, max_words = 100, 
                      background_color = 'white', stopwords = stop_words).generate_from_frequencies(word_freq_normal)

plt.figure(figsize = (12, 8))
plt.imshow(wc_normal, interpolation = 'bilinear')
plt.axis('off')
## (-0.5, 399.5, 329.5, -0.5)
plt.show()

We see a lot of words that are part of normal chat conversations such as like. Hoever, we also see a lot of stopwords as capitalized stopwords are not part of the stop_words nltk has.

'I' in stop_words
## False

Wordclouds of sexual predators

wc_predator = WordCloud(width = 400, height = 330, max_words = 100, 
                        background_color = 'white', stopwords = stop_words).generate_from_frequencies(word_freq_predator)

plt.figure(figsize = (12, 8))
plt.imshow(wc_predator, interpolation = 'bilinear')
plt.axis('off')
## (-0.5, 399.5, 329.5, -0.5)
plt.show()

The most frequent words for the predators are highly similar for both normal users and sexual predators as expected.

Cleaning the text files and removing stopwords could give us better results. Also since many of the conversations contain jyst the stop words for example whats up? After we appl stemming whats will become what and thus both of them will be removed. So we will filter conversations with atleast 2 words.

  • Cleaning the data

Instead of using regular stopwords we will be using smart stopwords that contain general words of conversations.

# getting the smart stopwords
smart_stopwords = open('STOPWORDS.txt')
smart_stopwords = smart_stopwords.readlines()
smart_stopwords = [str.rstrip(word) for word in smart_stopwords]
print(f'Length of regular stopwords is {len(stop_words)}')
## Length of regular stopwords is 179
print(f'Length of smart stopwords id {len(smart_stopwords)}')
## Length of smart stopwords id 571
ps = PorterStemmer()
pattern = r"[$&+,:;=_?@#|\[\]{}'<>.^*()%!-]"
def clean_text(text):
    text = ' '.join(re.sub(pattern, '', text).strip().split()) # remove punctuations
    text = ' '.join([word for word in text.split() if word.isalpha()]) # only words
    #text = BeautifulSoup(text).get_text() #html
    text = text.lower() #lower case
    text = contractions.fix(text) #contractions
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore') #symbols
    text = ' '.join(word for word in text.split() if word not in smart_stopwords) # stopwords
    text = ' '.join([ps.stem(word) for word in text.split()]) #stem
    return text
spi['chat_message'] = spi['chat_message'].apply(lambda x: clean_text(x))
spi.head()
##                     conversation_id  message_line  ... author_id sexual_predator
## 0  e621da5de598c9321a1d505ea95e6a2d             1  ...         0               0
## 1  e621da5de598c9321a1d505ea95e6a2d             2  ...         0               0
## 2  e621da5de598c9321a1d505ea95e6a2d             3  ...         0               0
## 3  e621da5de598c9321a1d505ea95e6a2d             4  ...         0               0
## 4  e621da5de598c9321a1d505ea95e6a2d             5  ...         0               0
## 
## [5 rows x 7 columns]

As expected some conversations with just the stop words have zero words now.

Filtering out conversations with atleast two words

spi_sub = spi[spi['chat_message'].apply(lambda x: len(x.split()) > 1)]
spi_sub.index = range(0, spi_sub.shape[0])
spi_sub.shape
## (412898, 7)

Recreating the wordclouds

Normal users

word_freq_normal = gen_freq(spi_sub[spi_sub['sexual_predator'] == 0]['chat_message'].str)
word_freq_predator = gen_freq(spi_sub[spi_sub['sexual_predator'] == 1]['chat_message'].str)

wc_normal = WordCloud(width = 400, height = 330, max_words = 100, 
                      background_color = 'white').generate_from_frequencies(word_freq_normal)

plt.figure(figsize = (12, 8))
plt.imshow(wc_normal, interpolation = 'bilinear')
plt.axis('off')
## (-0.5, 399.5, 329.5, -0.5)
plt.show()

Sexual Predator

wc_predator = WordCloud(width = 400, height = 330, max_words = 100, 
                      background_color = 'white').generate_from_frequencies(word_freq_predator)

plt.figure(figsize = (12, 8))
plt.imshow(wc_predator, interpolation = 'bilinear')
plt.axis('off')
## (-0.5, 399.5, 329.5, -0.5)
plt.show()

We see words like love, babi, sleep, miss, play, friend, pu–i, se-y and s-x, s-ck are some of the most frequent words used by the predator.

sns.countplot(spi_sub['sexual_predator'])
plt.show()

As we see there is a huge class imbalance. This is expected as in normal cases we would find many more normal users than sexual predators.

spi_sub['sexual_predator'].value_counts()
## 0    397957
## 1     14941
## Name: sexual_predator, dtype: int64

Machine Learning model

We need to make sure the classes have the same proportion in both sets.

X_train, X_test, y_train, y_test = train_test_split(np.array(spi_sub['chat_message'].apply(lambda x: np.str_(x))), np.array(spi_sub['sexual_predator']), stratify = np.array(spi_sub['sexual_predator']),test_size = 0.20, random_state = 100)
freq, freq2 = np.unique(y_train, return_counts = True)[1]
freq3, freq4 = np.unique(y_test, return_counts = True)[1]
print(f'The proportion of normal users in training set is {((freq/(freq+ freq2))*100):.2f}% whereas the frequency of sexual predators is {((freq2/(freq+ freq2))*100):.2f}%')
## The proportion of normal users in training set is 96.38% whereas the frequency of sexual predators is 3.62%
print(f'The proportion of normal users in test set is {((freq3/(freq3+ freq4))*100):.2f}% whereas the frequency of sexual predators is {((freq4/(freq3+ freq4))*100):.2f}%')
## The proportion of normal users in test set is 96.38% whereas the frequency of sexual predators is 3.62%
tv = TfidfVectorizer(min_df = 0., max_df = 1., norm = 'l2', use_idf = True, smooth_idf = True)

train_tfidf = tv.fit_transform(X_train)
test_tfidf = tv.transform(X_test)

print(train_tfidf.shape)
## (330318, 93602)
print(test_tfidf.shape)
## (82580, 93602)
  1. Logistic Regression Classifier
log_model = LogisticRegression(penalty = 'l2', solver = 'lbfgs', multi_class = 'ovr', max_iter = 1000, C = 1, random_state = 100)

log_model.fit(train_tfidf, y_train)
## LogisticRegression(C=1, max_iter=1000, multi_class='ovr', random_state=100)

Predicting on the test set

log_predictions = log_model.predict(test_tfidf)
print(classification_report(y_test, log_predictions))
##               precision    recall  f1-score   support
## 
##            0       0.97      1.00      0.98     79592
##            1       0.60      0.08      0.15      2988
## 
##     accuracy                           0.96     82580
##    macro avg       0.78      0.54      0.56     82580
## weighted avg       0.95      0.96      0.95     82580

As expected this model performed very well on the majority class and performed weakly on the minority class. Models like SVM that uses a hyperplane to separate classes would likely yield similar results.

  1. Support Vector Machines
svm = LinearSVC(penalty = 'l2', C = 1, random_state = 100)
svm.fit(train_tfidf, y_train)
## LinearSVC(C=1, random_state=100)
svm_predictions = svm.predict(test_tfidf)
print(classification_report(y_test, svm_predictions))
##               precision    recall  f1-score   support
## 
##            0       0.97      1.00      0.98     79592
##            1       0.58      0.10      0.17      2988
## 
##     accuracy                           0.96     82580
##    macro avg       0.77      0.55      0.57     82580
## weighted avg       0.95      0.96      0.95     82580

As expected the SVM model again performs poorly in classifying the minority class.

The class imbalance needs to be handled for a good model. We could either undersample the majority class or oversample the minority class. Within oversampling, we could increase the number of samples by duplicating the observations of the minority class or create synthetic samples similar but not exactly the same. Thus for further work algorithms like SMOTE (Synthetic Minority Oversampling Technique) could be used to create synthetic samples.

Interpret and Discuss

Summarize the results from your study in as plain of language as possible. How does this relate to previous literature? Where the results supportive of your hypotheses? What have we learned from you doing this analysis/study?

We saw that both precision and recall for the minority class i.e. the sexual predator are very low for both the logistic regression model and support vector machines. This means on future data the model will fail to detect the sexual predators. The class imbalance needs to be handled either through a model that can handle class imbalance or through synthetic sampling.

Models like XGBoost or CatBoost that start with weak learners and gradually improve the models could handle the class imbalance problem. However, for future work, it would be better to perform SMOTE operation on the training set to balance the data and then use algorithms like XGBoost to get better results. Also, algorithms like LSTM could be used to capture relationships between the words. (Liu, Suen, and Ormandjieva 2017) used LSTM to perform the classification task and got pretty good results for both classes.

  • Summary

As we expected the sexual predators would use a lot of normal language like normal users that needed to be removed for the model. When we used a wordcloud to see the most common words after text cleaning we observed some intimate words for the sexual predators. We also saw how the classifiers are unable to identify the sexual predators due to the high imbalance in the data. Such datasets are examples of why accuracy should not be relied on as we had an accuracy of over 96% for both the models.

References

Include your references in APA style.

Inches, Giacomo, and Fabio Crestani. n.d. “Overview of the International Sexual Predator Identification Competition at PAN-2012,” 12.

Liu, Dan, Ching Yee Suen, and Olga Ormandjieva. 2017. “A Novel Way of Identifying Cyber Predators.” arXiv:1712.03903 [Cs], December. http://arxiv.org/abs/1712.03903.

Morris, Colin, and Graeme Hirst. n.d. “Identifying Sexual Predators by SVM Classification with Lexical and Behavioral Features,” 12.

“PAN @ CLEF 2012 - Sexual Predator Identification.” n.d. Accessed June 19, 2020. https://pan.webis.de/clef12/pan12-web/sexual-predator-identification.html.

Parapar, Javier, David E Losada, and Alvaro Barreiro. n.d. “A Learning-Based Approach for the Identification of Sexual Predators in Chat Logs,” 12.

Villatoro-tello, Esaú, Antonio Juárez-gonzález, Hugo Jair Escalante, Manuel Montes-y-gómez, and Luis Villaseñor-pineda. n.d. A Two-Step Approach for Effective Detection of Misbehaving Users in Chats Notebook for PAN at CLEF 2012.